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ASSIGNMENT OF SEMANTIC TAGS TO PHRASES FOR GRAMMAR GENERATION 



5 The present invention relates to the field of automated language 

understanding for dialogue applications. 

Automatic dialogue systems and telephone based machine enquiry 
systems are nowadays widely spread for providing information, as e.g. train or flight 
timetables or receiving enquiries from a user, as e.g. bank transactions or travel 
10 bookings. The crucial task of an automatic dialogue system consists of the extraction of 
necessary information for the dialogue system from a user input, which is typically 
provided by speech. 

The extraction of information from speech can be divided into the two 
steps of speech recognition on the one hand side and mapping of recognized speech to 
15 semantic meanings on the other hand side. The speech recognition step provides a 
transformation of the speech received from a user in a form that can be machine 
processed. It is then of essential importance, that the recognized speech is interpreted 
by the automatic dialogue system in the correct way. Therefore, an assignment or a 
mapping of recognized speech to a semantic meaning has to be performed by the 
20 automatic dialogue system. For example for a train timetable dialogue system the 

enquiry "I need a connection from Hamburg to Munich", the two cities "Hamburg" and 1 
"Munich" have to be properly identified as origin and destination of the train travel. 

Essential fragments of the above sentence "from Hamburg" or "to 
Munich" have to be extracted and to be understood by the automatic dialogue system to 
25 the extent, that the phrase "from Hamburg" is mapped to the origin semantic tag 

whereas the phrase "to Munich" is mapped to the destination semantic tag. When all 
semantic tags like origin, destination, time, date, or other travel specifications are 
mapped to phrases of the user enquiry, the dialogue system can perform a required 
action. 
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The assignment of mapping of recognized phrases to semantic tags is 
typically provided by some kind of grammar. A grammar contains rules defining the 
mapping of semantic tags to the phrases. Such rule based grammars have been the most 
investigated subject of research in the field of natural language understanding and are 
5 often incorporated in actual dialogue systems. An example of an automatic dialogue 
system as well as a general description of automatic dialogue systems is given in the 
paper "H. Aust, M. Oerder, F. Seide, V. Steinbiss; the Philips Automatic Train 
Timetable Information System, Speech Communication 17 (1995) 249-262". 

Since an automatic dialogue system is typically designated to a distinct 

10 purpose, as e.g. a timetable information or an enquiry processing system, the underlying 
grammar is individually designed for those distinct purposes. Most of the grammars 
known in the prior art are manually written in that sense that the rules constituting the 
grammar cover a huge set of phrases and various combinations of phrases that may 
appear within a dialogue. 

.15 In order to perform a mapping between a phrase and a semantic tag, the 

phrase or the combination of phrases has to match at least one of the rules of the 
manually written grammar. The generation of such a hand written grammar is an 
extreme time consuming and resource wasting process, since every possible 
combination of phrases or variations of a dialogue have to be explicitly taken into 

20 account by means of individual rules. Furthermore a manually created grammar is 
always subject to maintenance, because the underlying set of rules may not cover all 
types of dialogues and types of phrases that typically occur during operation of the 
automatic dialogue system. 

In general, grammars for automatic dialogue systems are application 

25 related, which means that a distinct grammar is always designated to a distinct type of 
automatic dialogue system. Therefore, for each type of automatic dialogue system a 
special grammar has to be manually constructed. It is clear that such a generation of a 
multiplicity of different grammars represents a considerable cost factor which should be 
minimized. 

30 In order to reduce a rather costly amount of manual efforts for 

generation, maintenance and adaptation of grammars, methods for an automatic 
generation of grammars or automatic learning of grammars have been introduced 
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recently. An automatic construction of a grammar is typically based on a corpus of 
weekly annotated training sentences. Such a training corpus can for example be derived 
by logging the dialogue of an existing application. However, an automatic learning 
further requires a set of annotations indicating which phrases of the training corpus are 

5 assigned to which known tag. Typically, this annotation has to be performed manually 
but it is in general less time consuming than the generation of an entire grammar. 

The paper "K Macherey, F. J. Och and H. Ney; Natural Language 
Understanding using Statistical Machine Translation', presented at the 7 th European 
Conference on Speech Communication and Technology, Aalborg, Denmark, September 

10 2001" which is also available from the URL "http://wasserstoff.informatik.iwth- 

aachen.de/Colleagues/och/eurospeech2001.ps" describes the automatic learning of a 
grammar. 

In fact the document discloses an approach to natural language 
understanding, which is derived from the field of statistical machine translation. The 

15 problem of natural language understanding is described as a translation from source 

sentence to a formal language target sentence. This method therefore aims to reduce the 
employment of grammars in favour of a learning of dependencies between words and 
their meaning automatically. To this extent the mentioned method deals with a 
translational problem rather than with the automatic generation of a grammar. 

20 In contrast to that, the US Patent application US 2003/006 1 024 Al 

explicitly concentrates on the learning of a grammar. This method is based on 
determining sequences of terminals or of terminals and wild cards linked to non 
terminals of a grammar in a training corpus of sentences. After sequences of terminals 
or terminals and wild cards have been determined they are assigned to a non terminal or 

25 no non terminal by means of a classification procedure. This classification in turn uses 
an exchange procedure which is based on an exchange algorithm. The exchange 
algorithm guarantees an efficient optimization of a target function which takes account 
of all incorrect classifications and which is iteratively optimized in the classification of 
the sequences of terminals or of terminals and wild cards. Thereby the order of the non 

30 terminals in the training sentences does not have to be annotated manually since the 
target function uses only the information as to which sequences of terminals or of 
terminals and wild cards and which non terminals are present in the training sentences. 
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Furthermore, the exchange procedure guarantees an efficient (local) optimization of the 
target function since only a few operations are necessary for calculating the change in 
the target function upon the execution of an exchange. 

The present invention aims to provide another method for mapping 
5 semantic tags to phrases and thereby providing the generation of a grammar for an 
automatic dialogue system. 

The invention provides an automatic learning of semantically useful 
word phrases from weekly annotated corpus sentences. Thereby a probabilistic 
dependency between word phrases and semantic concepts or semantic tags is estimated. 

1 0 The probabilistic dependency describes the likelihood that a given phrase is mapped or 
assigned to a distinct semantic tag. In this context a phrase is used as a generic term for 
fragments of a sentence, a sequence of words or in the minimal case a single word. 

The probabilistic dependency between phrases and tags is further 
denoted as mapping probability and its determination is based on the training corpus of 

1 5 sentences. Initially, the method has no information about the annotation between tags 
and phrases of the training corpus. In order to perform a calculation of the mapping 
probability a weak annotation between phrases and semantic tags must be somehow 
provided. Such a weak annotation can be realized for example by assigning a set of 
candidate semantic tags to a phrase. Alternatively an IEL (inclusion/exclusion list) can 

20 be used. An IEL represents a list that includes or excludes various semantic tags that 
can be mapped or must not map a phrase. 

According to a preferred embodiment of the invention, for each phrase of 
the training corpus an entire set of mapping probabilities between the phrase and the 
corresponding set of candidate semantic tags is determined. In this way a probability 

25 that a given phrase is assigned to a semantic tag is calculated for each possible 

combination between the phrase and the entire set of candidate semantic tags which 
yields in an automatic learning or generation of a grammar. 

According to a further preferred embodiment of the invention, a semantic 
tag is mapped to a phrase of the training corpus in accordance to the highest mapping 

30 probability of the set of mapping probabilities. This means that the mapping or 

assigning of a tag to a given phrase of the training corpus is determined by the highest 
probability of the set of mapping probabilities for the given phrase. 
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The method for mapping semantic tags to phrases makes therefore 
explicit use of the determination of mapping probabilities. Such a mapping probability 
can for example be determined from the given weak annotation between phrases and 
semantic tags of the training corpus. Generally, there exists a plurality of probabilistic 
5 means to generate such a mapping probability. 

According to a further preferred embodiment of the invention, the 
statistical procedure, hence the calculation of the mapping probabilities, is performed 
by means of a expectation maximization (EM algorithm). EM algorithms are commonly 
known from forward backward training for Hidden Markov Models (HMM). A specific 

10 implementation of the EM algorithm for the calculation of mapping probabilities is 
given in the mathematical annex. 

According to a further preferred embodiment of the invention, a 
grammar can be derived from the performed mappings between a candidate semantic . 
tag and a phrase. Preferably the calculated and performed mappings are stored by some 

15 kind of storing means in order to keep the computational efforts on a low level. Finally,, 
the derived grammar can be applied to new, unknown sentences. 

The overall performance of the method of the invention can be enhanced 
when the EM algorithm is applied iteratively. In this case the result of an iteration of 
the EM algorithm is used as input for the next iteration. For example an estimated 

20 probability that a phrase is mapped to a tag is stored by some kind of storing means and 
can then be reused in a proceeding application of the EM algorithm. In a similar way 
the initial conditions in form of weak annotations between phrases and tags or in form 
of an IEL can be modified according to previously performed mapping procedures 
according to the EM algorithm. 

25 In order to test the efficiency and reliability of an EM based algorithm 

for grammar learning, the EM based algorithm has been implemented by making use of 
a so called Boston Restaurant Guide corpus. Experiments based on this implementation 
demonstrate that an EM based procedure leads to better results than a procedure based 
on an exchange algorithm as illustrated in US Pat No. 2003/0061024 Al, especially 

30 when large training corpora are used. Furthermore, it has been demonstrated, that a 
repeated application of the EM based procedure leads to continuous improvements of 
the generated grammar. The tag error rate, which is defined as the ratio between the 
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number of falsely mapped tags and the total number of tags, shows a monotone descent 
when described as a function over the number of iterations. The main improvements of 
the tag error rate are already reached after two or even one iteration. 

In the following, preferred embodiments of the invention will be 
5 described in greater detail by making reference to the drawings in which: 



Fig. 1 is illustrative of a flow chart for the mapping of phrases and tags 
by means of an EM based algorithm, 
10 Fig. 2 shows a flow chart illustrating a dynamic programming 

construction of a table L which is a subroutine for the EM algorithm, 
Fig. 3 is illustrative of a flow chart describing the implementation of the 
EM algorithm. 



15 

Figure 1 shows a flow chart for mapping of semantic tags to phrase 
based on the EM algorithm. In a first step 100 a phrase w is extracted from a training 
corpus sentence. In the following step 102 a step of mapping probabilities p(k, w) for 
each tag k from a list of unordered tags k . 

20 Once a set of mapping probabilities has been calculated for the phrase 

w , the highest probability of the set of mapping probabilities p(k 9 w) is determined in 
the following step 104. In the next step 106 the mapping between the phrase w and a 
semantic tag k is performed. The phrase w is mapped to a single tag k according to 
the highest probability p(k, w) of the set of mapping probabilities, which has been 

25 determined in step 104. In this way the mapping between a semantic tag k and a phrase 
w is performed by making use of a probabilistic estimation based on a training corpus. 
The probabilistic estimation determines the likelihood, that a semantic tag k is mapped 
to a phrase w within the training corpus. When the mapping has been performed in 
step 106 it is stored by some kind of storing means in step 108 in order to provide the 

30 performed mapping for a proceeding application of the algorithm. In this way, the 
procedure can be performed iteratively leading to a decrease of the tag error rate and 
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thus to an enhancement of the reliability and efficiency of the entire grammar learning 
procedure. 

The calculation of the mapping probability which is performed in step 
102 is based on the EM algorithm, which is explicitly explained in the mathematical 
5 annex by making reference to figure 2 and figure 3. 

The calculation of the mapping probability according to the EM 
algorithm is based on two additional probabilities denoted as L(i 9 k') , and R(i 9 K*) , 
respectively, representing the probabilities for all permutations of an unordered tag 
sublist k' of length i -1 over the left subsentence and the unordered complement tag 
10 sublist over the right subsentence of a training corpus sentence from position i + 1 . 

Figure 2 is illustrative of a flow chart for calculating the probability 

In a first step 200, the initial probability for i - 0 is set to unity before in 
the next step 202, the index of the tag sublist i is initialized to i = 1 . In the following 
15 step 204, each sublist of length i is selected from the unordered tag sublist k' . After 
selecting each sublist the calculation procedure continues with step 206, in which the 
probability L(i 9 K 9 ) = 0 for a permutation is set to zero. Then, in step 208 each tag k 
from the unordered sublist is selected in step 208, and successively provided to step 
210, in which the permutation probability is calculated according to: 
20 L(i, k') = L(i 9 O + L(i -\ 9 k'\ {k}) 'p(k\w t ). 

After the calculation of L{i 9 K r ) , in step 212, the index i is compared to 
the number of words in the phrase W.Ifi is less or equal \w\ , the procedure returns to 
step 204 by incrementing index i by one. Otherwise, when i is larger than , the 

procedure for calculating the permutation probability ends with step 214. 
25 Once the permutation probability has been calculated according to the 

procedure described in figure 2, an analog calculation is performed in order to obtain 

the permutation probability R for the complement sublist of the right subsentence. 

Figure 3 finally illustrates the implementation of the EM algorithm for 

calculating a mapping probability p(k y w) by making use of the above described 
30 permutation probabilities. 
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In the first step 300 for all tags k and phrases w the probability 
p(k | w) is initialized by setting q = 0 and setting q(k 9 w) = 0 , before in step 302 one 
of the training corpus sentences is selected. Since every sentence of the training corpus 
is taken into account for the grammar learning, the following step 304 has to be applied 
5 to all sentences of the training corpus. 

After a sentence of the training corpus has been selected in step 302 it is 
further processed in step 304, in which the steps 306, 308, 310, and 3 12 are 
successively performed. In step 306, an unordered tag list k' as well as an ordered 
phrase list W are selected. In the next step 308, the dynamic programming construction 
10 of the table L is performed as described in figure 2. After that, a similar procedure is 
performed with the reversed table R in step 310. 

The calculated tables L and R as well as the initialized probabilities are 
further processed in step 312. Step 312 can be interpreted as a nested loop with an 
index i = 1, i < . For each r, step 314 is performed initializing another loop for each 

15 of the unordered sublists k of length i — 1 . For each unordered sublist the step 3 1 6 is 
performed selecting each tag k £ tc r and performing the following calculation in step 
318: 

q' = L(i - 1,/c') ■ p(k | w,) - R(f + UQc W \ {*}), 
where q ' is further processed in step 320 according to: 
20 ^(k 9 Wf) -q(k 9 w t ) + q' and q = q + q'. 

When the steps 318 and 320 have been executed for each tag k & k' in 
step 316, when step 316 has been performed for each unordered sublist of length / — 1 
in step 3 14, when step 3 14 has been performed for each index i < in step 312, and 

when finally the entire procedure given by step 312 has been performed for each 
25 sentence of the training corpus, then in step 322 the mapping probability is determined 
according to: 

p(k 9 w) = q (k 9 w)/q V£, w . 
Once the mapping probability has been determined, it is preferably 
stored by some kind of storing means. For the purpose of grammar learning and for 
30 mapping a tag to a given phrase all probabilities of all possible combinations of phrases 
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and candidate semantic tags are calculated and stored. Finally, the mapping of a 
semantic tag to a given phrase is performed according to the maximum probability of 
all calculated probabilities for the given phrase. 

Based on the plurality of performed mappings, the grammar is finally 
5 deduced and can be applied to other and hence unknown sentences that may occur in 
the framework of an automated dialog system. 

Especially when the EM algorithm is repeatedly applied to a training 
corpus of sentences, the overall efficiency of the grammar learning procedure increases 
and the tag error rate decreases. 
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MATHEMATICAL ANNEX 

According to a preferred embodiment of the invention, the mapping 
probability p(k 9 w) , that a given phrase w is mapped to a semantic tag k is calculated 
by means of an expectation maximization (EM) algorithm. The implementation and 
5 adaptation of a EM algorithm are described in this section. 

Here, an approach which is similar to forward backward training of 
HMMs is followed. The general equation for EM based grammar learning is given by: 

^p(K\W).N K (k 9 w) 

where W is a sequence of phrases, K is a tag sequence, w is a phrase, 

10 k 

is a semantic tag, N K (k,w) is the occurrence that k and w occur 
together for a given W and K 9 and p(K\ W) gives the probability that a sequence of 
phrases W is mapped to a tag sequence K. 

This approach assumes that the number of tags s equals the number of 
1 5 phrases. The numerator of equation (1): 

K 

adds for each tag sequence K the probability p(K\ W) as many times 
as the tag k is mapped to phrase w in this tag sequence. This may be rewritten as 
follows: 

20 *Zp&\Vn-NK(k,w)- %^p(K\W).S(k ( ,k)-S(w i9 w) 

K K i 

= 1 S P(K | W) 

i:wt=w K:k, =A 

V v ' 

where S(x, y) is the usual delta function 



WO 2005/048240 



11 



PC17EB2004/052352 



and p(k. = k \ W) is the overall probability that the phrase w at position 

i in the phrase string W is mapped to tag k. Similarly, for the denominator of Eq.(l) the 
following holds: 

^p(K\W). 2X(*',W')= E Hp(K\W).N K (k',w') 



resulting into the estimation formula 



For the estimation over the whole corpus, numerator and denominator 
must be separately computed und summed up for each corpus sentence. 
10 The probability p(k t -k\W) that is central to Eq.(l) computes the 

probability of all tag sequences that have tag k for the phrase at position i . Before and 
after position i , all remaining permutations of tags are possible. If k is the unordered 
list of tags and tv(k) the set of all possible permutations over k then 
p{k t =k\W) 
15 = 2> (K\W) . 

- s fnp(^iw,)Wiw,)/n^jw y )1 

/Ce/r(Ar):*,=*V / \j=M ) 

( (** Yi f . r * 



^s(a:\{*:}):M=i- 



Z(i - 1, tc') is the probability for all permutations of the unordered tag 
sublist K f of length i-1 over the left subsentence up to position f -1, and 
20 H(i + 1, (a: \ a:') \ {k}) is the probability for all permutations of the unordered 

complement tag sublist (fc\K r )\{k} of length s-i over the right subsentence from 
position i +1. These values can be recursively computed: 
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= E Z Il/K*y|w y ) 

are*' /=1 

Are*' A-€/t(a:*\{A}) j=l 

= E^i^)- z 0'- 1 ^'\w). 

Similarly, 

*(/,*:') = I>(*l w,)-R(i + l,K'\{k}) . 



(3) 



(4) 



Storing and re-using the values L(i,k') and R(i,tc') in Eqs. (3) and (4) 

(\ k n 

reduces computational costs. For a given / , there are unordered tag lists jt' and 

[i J 



thus X' 



/=i 



• z* operations to perform to fully compute the table L (same holds for 



10 table R ). However, no closed form or good estimation for this has been found, so it is 
not clear whether the computation is not efficient in the sense mat it has a polynomial 
computing time. 

The implementation of the EM algorithm is a direct consequence from 
the above mentioned expressions. The implementation is further described by Figs 2 
15 and 3 for one iteration. There are just some notes about the implementation: 

For technical reasons, each element of the unordered tag list k gets a 
unique index in the range from 1 to | k |. An unordered sublist k' of length i is 
represented as an i- dimensional vector whose scalar elements are the indexes of the 
elements from k that participate in k' . This vector is incremented 
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to successively obtain all unordered sublists of length i . 

The access to L(i 9 k') for some unordered subhst k' of length i is realized by 

computing an index a with L(i 9 k') = L(a) from the vector representation of k' : 

a = ±2°>- 1 , 

5 where a y is the j th element of the vector representation of k' . The addition or 

removal of a tag to or from k* is reflected in the index of the tag. The index /? of the 
complement unordered list of tags needed for accessing R(i 9 (fc\fc , )\{k})-R(j5) is 
easily computed by 

>ff = 2 M -l-a-2 0 - 1 . 

10 For faster computation, there is a table whose j th entry contains the 

value 2 J . 

The dynamic programming computation of the list R is performed by 
calling the subroutine that uses dynamic programming to compute the list L with a list 
of phrases W whose phrase order is reversed, i.e. w! = w s _ M . 
15 Sentences with an unequal number of tags and phrases are discarded. 

The initial probabilities p(k 9 w) are read in from a file and p(w) is 
computed as marginal for p(k \ w) . The file simply lists k 9 w , and p(k 9 w) in one 
ASCII line. The estimated probabilities p(k 9 w) are written down in the same format 
and thus serve as input for the next iteration. 
20 Figure 2 illustrates a flow chart for iteratively calculating the probability 

L(i 9 k') for all permutations of the unordered tag sublist k' of length i over the left 
subsentence up to position i . 

Initially, in step 200 the probability L(0,{}) is set to unity, before the 
index i is set to i = 1 in step 202. 

In step 204, a loop starts and each unordered sublist k' of length i is 
selected. In the proceeding step 206, the probability L(i 9 tc') = 0 for each selected 
unordered sublist is set to zero before in the next step 208 each tag k which is an 
element of the unordered sublist is selected. In step 210 finally, the probability L(i 9 k') 
is calculated according to: 
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L(i 9 k') = L(Uk') + L(i - 1 9 k' \ {k}) . P (* | w,) . 
In step 212 it is checked whether the index i is smaller or equal the 
number of words in the phrase. If i ^ \w\ in step 212, then i is incremented by one, and 

the procedure returns to step 204. When in contrast i > \w\ , then the procedure stops in 
5 step 214. 

The calculation of the probability for all permutations of the unordered 
complement tag sublist of the right subsentence from position i + 1 is performed 
correspondingly. 

Figure 3 is illustrative of a flow chart diagram for calculating a mapping 

10 probability p(k 9 w) on the basis of the EM algorithm. In step 300 for all tags k and 
phrases w the probability p{k \ w) is initialized by setting q = 0 and setting 
q(k 9 w) = 0 9 before in step 302 one of the training corpus sentences is selected. Since 
every sentence of the training corpus is taken into account for the grammar learning, the 
following step 304 has to be applied to all sentences of the training corpus. 

15 After a sentence of the training corpus has been selected in step 302 it is 

further processed in step 304, in which the steps 306, 308, 310, and 3 12 are 
successively applied. In step 306, an unordered tag list k as well as an ordered phrase 
list W are selected. In the next step 308, the dynamic programming construction of the 
table L is performed as described in figure 2. After that, a similar procedure is 

20 performed with the reversed table R in step 310. 

The calculated tables as well as the initialized probabilities are further 
processed in step 312. Step 312 can be interpreted as a nested loop with an index 
i = l,i < \w\ . For each i step 314 is performed initializing another loop for each of the 
unordered sublists k' of length i - 1 . For each unordered sublist the step 316 is 

25 performed selecting each tag k £ k ' and performing the following calculation in step 
318: 

% = L(i - 1,0 . p(k | w { ) • R(i + 1, {K \k 9 ) \ {*}), 
where q ' is further processed in step 320 according to: 
q(k^) = q(k 9 w i ) + q' and q=q+q'. 
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When the steps 318 and 320 have been executed for each tag k g k' in 
step 316, when step 316 has been performed for each unordered sublist of length i - 1 
in step 3 14, when step 3 1 4 has been performed for each index i < in step 3 1 2, and 

when finally the entire procedure given by step 312 has been performed for each 
5 sentence of the training corpus, then in step 322 the mapping probability is determined 
according to: 



